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Abstract 

This  paper  addresses  structural  change  in  a  linear  regression  model  with 
an  unknown  change  point.  The  change  point  is  estimated  by  maximizing  a 
sequence  of  Wald-type  statistics.  This  paper  focuses  on  the  convergence  rate 
of  the  estimator.  T-consistency  is  obtained.  Asymptotic  distributions  for 
the  estimated  change  point  and  the  estimated  regression  parameters  are  also 
considered.  The  analysis  applies  to  both  pure  and  partial  structural  changes. 
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1      Introduction 

In  a  regression  framework,  structural  change  may  be  considered  a  regression  equation 
with  a  shift  in  some  of  the  regression  coefficients.  This  paper  focuses  on  the  inference 
of  the  shift  point  when  it  is  unknown  and  must  be  estimated.  Estimation  of  the  shift 
point  is  based  on  Wald-type  statistics.  These  statistics  are  typically  used  for  testing 
the  existence  of  a  structural  change;  see  Chow  (1960)  for  known  shift  point  and 
Andrews  (1993)  for  unknown  shift  point.  When  thejhift  point  is  not  known,  a 
sequence  of  Wald  statistics  is  used  in  constructing  the  test.  This  sequence  results 
from  sequential  estimation  carried  out  for  every  possible  sample  split.  The  shift-point 
estimator  is  then  defined  as  the  point  where  the  Wald  statistic  achieves  its  global 
maximum.  For  ease  of  reference,  we  shall  call  this  estimator  the  W-estimator.  Under 
normality  with  independent  and  identically  distributed  errors,  the  Wald  statistic  is 
just  the  F-statistic.  When  a  single  parameter  changes,  the  W-estimator  is  identical 
to  maximizing  a  sequence  of  t-statistics  in  absolute  value. 

The  Wald  type  statistics  can  be  thought  of  as  a  measure  of  the  distance  between 
the  estimator  of  pre-shift  parameters  and  that  of  post-shift  parameters.  Correctly 
identifying  the  shift  point  will  tend  to  maximize  this  distance  and  vice  versa.  One 
may  also  consider  the  pre-shift  sample  drawn  from  one  distribution  and  post-shift 
from  another  whereby  the  Wald  statistics  can  be  viewed  as  "between  sample  vari- 
ance." Correctly  classifying  the  observations  into  two  samples  will  maximize  the 
"between  samples  variance"  relative  to  "within  samples  variance,"  as  in  the  classifi- 
cation framework.  It  is  interesting  to  note  and  is  straightforward  to  prove  that,  when 
the  Wald-statistics  are  computed  using  a  sequence  of  least  squares  estimators,  the 
W-estimator  is  equivalent  to  minimizing  the  sum  of  squared  residuals.  Furthermore, 
because  LM-type  statistics  and  LR  statistics  are  monotonic  transformations  of  Wald 
statistics  in  linear  models,  the  W-estimator  can  also  be  obtained  by  maximizing  a 
sequence  of  these  latter  statistics.  This  may  not  be  true  for  nonlinear  models  or  for 
estimation  methods  other  than  least  squares. 
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There  are  at  least  two  advantages  for  basing  the  estimation  on  Wald-type  statis- 
tics. First  and  most  important,  Wald  statistics  are  particularly  convenient  for  the 
consistency  proof  given  later.  We  explore  the  structure  of  the  Wald  statistic  as  a 
distance  between  pre-shift  and  post-shift  estimated  parameters  and  are  able  to  show 
that  this  distance  is  globally  maximized  only  near  the  true  shift  point.  The  second 
advantage  is  the  computational  convenience.  Wald  statistics  are  easy  to  compute 
and  are  built  into  many  software  packages.  In  addition,  testing  for  a  parameter  shift 
and  estimating  the  shift  point  can  be  performed  concurrently.  Upon  rejecting  the 
null  hypothesis  of  no  change,  one  may  wish  to  estimate  a  shift  point  simply  from 
the  test  statistics.  Testing  for  a  change  using  Wald  statistics  is  studied  by  Hawkins 
(1987)  and  Andrews  (1993). 

Structural  change  in  linear  regressions  was  considered  early  on  by  Quandt  (1958). 
The  problem  has  received  considerable  attention  recently  in  the  literature.  Various 
tests  statistics  have  been  put  forward  in  use.  Well  known  examples  include  the  op- 
timal test  of  Andrews  and  Ploberger  (1992)  and  the  fluctuation  test  of  Ploberger, 
Kramer  and  Kontrus  (1989),  in  addition  to  the  Wald  type  of  tests  mentioned  earlier. 
Tests  for  change  in  nonstationary  time  series  models  have  been  developed  by  Per- 
ron (1989),  Banerjee,  Lumsdaine  and  Stock  (1992),  Chu  and  White  (1992),  Hansen 
(1992),  and  Vogelsang  (1992),  among  others.  Hackl  and  Westlund  (1989)  offer  a 
comprehensive  bibliography  on  the  subject  of  its  early  development.  The  estima- 
tion of  structural  change  has  also  been  examined  by  many  researchers.  A  survey  is 
given  by  Krishnaiah  and  Miao  (1988).  Methods  of  estimation  include  maximum  like- 
lihood (MLE),  nonparametric,  least  squares  (LS),  least  absolute  deviation  (LAD), 
and  Bayesian  estimation,  among  others.  MLE  is  studied  the  most,  examples  in- 
cluding Hinkley  (1970),  Picard  (1985)  and  Yao  (1988).  These  authors  deal  with 
i.i.d,  or  autoregressive  models  with  a  shift.  Picard  (1985)  and  Yao  (1987)  focus 
on  local  shifts.  Nonparametric  estimation  is  considered  by  Duembgen  (1991)  for 
two  i.i.d.  samples.  LS  is  studied  by  Hawkins  (1986)  and  Bai  (1993a)  and  LAD  by 


Bai  (1993b).  Bayesian  estimation  is  summarized  in  the  monograph  by  Broemeling 
and  Tsummi  (1987)  (also  see  Zivot  and  Phillips  (1992)).  Further  references  can 
be  found  in  Bai  (1992).  In  this  paper,  we  estimate  the  shift  point  by  using  Wald 
type  statistics.  These  statistics  are  based  on  least  squares  estimation.  Our  aim  is 
to  establish  T-consistency  for  the  proposed  estimator.  Despite  the  large  body  of 
literature,  T-consistency  has  not  been  obtained  for  linear  regressions. 

Furthermore,  we  consider  the  problem  in  the  context  of  partial  structural  change 
in  which  some  of  the  regression  parameters  hold  constant  throughout  the  sample. 
Thus  these  parameters  should  be  estimated  using  the  entire  sample  in  order  to  gain 
efficiency.  The  difficulty  of  the  consistency  proof  increases  dramatically  under  partial 
structural  change  in  contrast  to  pure  structural  change.  Bai  (1993a)  essentially 
considers  a  pure  change  problem  and  offers  a  simple  argument  of  consistency.  Despite 
the  increase  in  difficulty,  we  shall  work  with  a  partial  structural  change  model  because 
this  model  includes  pure  structural  change  as  a  special  case.  We  deal  with  the 
pure  and  partial  problems  in  a  unified  way  by  concentrating  out  the  unchanged 
parameters.  The  Wald  type  of  statistics  serves  this  purpose  well  and  allows  us  to 
focus  on  the  shifted  parameters.  Least  squares  method  is  used  to  estimate  the  pre- 
shift  and  post-shift  parameters  and  Wald  statistics  are  based  on  these  estimators. 
The  use  of  least  squares  estimation  enables  us  to  relax  the  assumption  of  a  known 
error  distribution  function  as  is  necessary  for  maximum  likelihood  estimation.  We 
also  allow  heterogeneous  and  dependent  disturbances.  The  design  of  the  regressors 
is  virtually  arbitrary  but  satisfies  some  standard  assumptions  under  least  squares 
estimation.  In  particular,  stochastic  regressors  and  time  trends  are  allowed.  In 
addition  to  the  T-consistency  of  the  shift  point  estimator,  root-T  consistency  for  the 
estimated  pre-shift  and  post-shift  parameters  is  also  established. 

Asymptotic  distribution  for  the  W-estimator  is  also  considered  in  this  paper.  This 
problem  is  studied  for  shifts  with  shrinking  magnitudes.  Under  a  shift  with  a  fixed 
magnitude,  it  is  mathematically  intractable  to  obtain  the  asymptotic  distribution 


except  for  some  i.i.d.  samples  with  shift,  as  shown  by  Hinkiey  (1970).  The  asymp- 
totic distribution  allows  one  to  construct  confidence  intervals  for  the  shift  point.  We 
also  derive  the  asymptotic  distribution  for  the  estimated  regression  parameters.  We 
show  that  the  asymptotic  distribution  for  the  latter  is  normal  and  is  the  same  as  if 
the  shift  point  were  known. 

This  paper  is  organized  as  follows.  Section  2  specifies  the  model  and  the  underly- 
ing assumptions.  Section  3  proves  the  T-consistency  of  the  W-estimator.  The  root-T 
consistency  and  asymptotic  normality  for  the  regression  parameters  are  also  derived 
in  this  section.  Section  4  examines  the  asymptotic  distribution  of  the  shift-point 
estimator.  Concluding  remarks  are  provided  in  Section  5.  Some  technical  matters 
are  collected  in  the  Appendix. 

2     Models  and  Assumptions 

Consider  the  following  linear  regression  with  a  structural  change: 

yt    =    xtf  +  ei,  (r  =  1,2,  ...,*<>)  (1) 

yt    =    x't0  +  z't6  +  eu         (*  =  fco  +  l,...,T)  (2) 

where  {et}  is  a  sequence  of  unobservable  disturbances  that  are  weakly  dependent; 
xt  and  zt  are  p  x  1  vectors  of  regressors  (xt  is  p  x  1  and  zt  is  q  x  1  where  p  <  q);  k0 
is  the  unknown  shift  point,  and  /?  and  6  are  unknown  parameters.  When  the  et  are 
uncorrected  over  time,  the  regressors  xt  and  zt  may  include  lagged  yt.  For  dependent 
disturbances,  we  shall  assume  the  et  are  uncorrected  with  xt  and  zt.  When  xt  =  zt, 
all  parameters  of  the  model  shift  at  time  ko.  When  zt  is  a  sub-vector  of  xt,  a  partial 
shift  model  is  obtained  [see  Andrews  (1993)  for  a  discussion].  More  generally,  we 
assume  zt  =  B!xt  for  some  matrix  R  with  full  column  rank  so  that  zt  is  a  linear 
transformation  of  xt.  The  purpose  is  to  estimate  /?,  6,  ko,  and  <j2  with  a  focus  on  k0. 
Let  us  introduce  some  further  notation.  Let  y  =  (yy, ...,  yr)',  X  =  (x\,  X2, ...,  xt)'> 
£  =  (£i,£2,--,£r)',  Xx  =  (x1,x2,...,ifc,0,...,0)',  X2  =  (0,...,0,xfc+i,...,xr)\  and 
XQ   =   (0, ...,0,Xfco+i,...,xr)'.     The  matrices  X\  and  X2  depend  on  k,  but  this 


dependence  will  not  be  displayed  for  notational  succinctness.  Furthermore,  let 
Z\  =  X\R,  Z2  =  X2R,  and  Zq  =  XqR.  Equations  (1)  and  (2)  can  then  be  rewritten 
as 

y  =  X0  +  Z0S  +  e.  (3) 

We  consider  the  problem  of  testing  6  =  0  based  on  the  Wald  test.  Because  k0  is 
unknown,  the  matrix  Z0  cannot  be  constructed.  The  strategy  of  testing  6  =  0  is  to 
obtain  a  sequence  of  Wald  statistics  by  replacing  Zq  with  Z2  for  k  =  p,  p  + 1, . .. ,  T  —  p 
(recall  Z2  depends  on  k),  where  p  is  the  dimension  of  it,  and  then  using  as  a  test 
statistic  the  maximum  value  of  the  sequence,  or  other  functional  of  this  sequence, 
such  as  the  mean  value.  For  each  fixed  k,  let  0k  and  6k  be  the  least  squares  estimators 
of  3  and  6,  respectively,  obtained  by  regressing  y  on  X  and  Z2.  The  corresponding 
Wald  statistic  is  then 

t()  =  1 — q — J — m — '    km**+1>-">T-*      w 

where  M  —  I  —  X{X' X)~x X  and  S(k)  is  the  sum  of  squared  residuals.  Note  that  Si, 
and  S(k)  can  be  written  as 

6k    =    (Z'2MZ2)-lZ2My 
S(k)    =    J>«  -  xtk  -  ztSkI(t  >  k))2 

where  /(■)  is  the  indicator  function.  The  test  statistic  for  testing  8  =  0  is 

WT=        sup       WT{k)  (5) 

iT<k<(\-i)T 

where  e  >  0  is  a  small  positive  number.  The  limiting  distribution  of  Wj  is  nonstan- 
dard, see  Andrews  (1993)  for  details. 

Now  assume  6^0.  Our  objective  is  to  estimate  the  shift  point  kQ.  We  define 
the  shift-point  estimator  as  the  location  where  the  maximum  of  Wald  statistic  is 
achieved,  namely, 

k  =  argmaxJ,<fc<T_pVVT(fc) 


where  p  is  the  number  of  columns  of  X.  We  call  k  the  W-estimator  for  ease  of 
reference.  This  approach  to  estimating  of  a  shift  has  been  used  in  empirical  applica- 
tions. Christiano  (1992)  looked  for  potential  changes  in  U.S.  GNP  using  F-statistics. 
Note  the  restriction  k  €  [p,  T  —  p]  is  needed  to  ensure  the  existence  of  least  squares 
estimators.  However,  no  restriction  of  the  form  k  €  [eT,  (1  —  e)T]  is  required. 

Although  both  the  numerator  and  denominator  of  Wr(k)  of  (4)  depend  on  k,  it 
is  not  difficult  to  show  that  k  can  be  obtained  by  maximizing  the  numerator  alone 
and  can  also  be  obtained  by  minimizing  the  denominator  alone.  We  state  this  simple 
result  as  a  proposition. 
Proposition  1 

A  A  A 

k  =  argmaxfcW:r(fc)  =  argmaxfc  6k(Z2MZ2)6k  =  argminfc5(fc). 

That  is,  the  W-estimator  obtained  by  maximizing  Wald-type  statistics  is  the  same 
as  minimizing  the  sum  of  squared  residuals.  To  see  this,  let  S  denote  the  sum  of 
squared  residuals  under  restricted  estimation  (8  =  0),  then  the  Wald  statistic  is 

^»  =  (^P)(^P). 

Because  S  does  not  depend  on  k  and  the  Wald  statistic  is  a  strictly  decreasing  trans- 
formation of  S(k),  it  follows  immediately  that  k  =  argminfc5(fc).  Or  equivalently, 
k  =  argmaxfc{5  — 5(A;)}.  The  proposition  then  follows  from  S  —  S(k)  =  8k(Z'2MZ2)Sk 
[see  Amemiya  (1985,  p.  31-33)]. 

For  a  linear  model  with  least  squares  estimation,  LR  and  LM  statistics  are  mono- 
tonic  (increasing)  transformations  of  the  Wald  statistics,  thus  W-estimator  can  also 
be  obtained  by  maximizing  LR  and  LM  statistics.  Despite  these  facts,  we  shall  work 
with  Wald  statistics  for  two  reasons.  First  and  foremost,  Wald  statistics  lead  nat- 
urally to  the  consistency  arguments,  as  will  be  seen  from  the  proofs  below.  Other 
forms  of  test  statistics  do  not  have  this  advantage.  Second,  under  estimation  methods 
other  than  least  squares  (e.g.  instrumental  variable  estimation  or  GMM  estimation), 


the  results  of  Proposition  1  may  not  hold.  In  particulax,  for  nonlinear  models,  the 
equivalence  of  Wald,  LR  and  LM  statistics  no  longer  holds,  but  our  analysis  which  is 
based  on  Wald  statistics  has  the  potential  to  be  extended  both  to  other  estimators 
and  nonlinear  models  as  well  as  simultaneous  equations  systems.  For  example,  Lo 
and  Newey  (1985)  and  Andrews  and  Fair  (1989)  consider  structural  change  problems 
based  on  Wald-type  test  statistics  for  linear  and  nonlinear  simultaneous  equations. 
In  the  case  of  pure  structural  change,  that  is,  xt  =  zt,  the  W-estimator  becomes 

k  =  argmax,  {(/?,  -  k)'[{X[Xx)-1  +  (Xyra)-l]-»(A  -  &)} 

where  f3\  =  (X[X\)~xX[y  and  $2  =  {X^Xt)"1 X'2y .  The  term  inside  the  braces 
represents  a  weighted  average  of  the  distance  (squared)  between  the  pre-shift  and 
post-shift  parameter  estimators.  The  W-estimator  maximizes  this  distance. 

In  addition  to  the  shift  point,  we  are  also  interested  in  the  regression  parameters. 
Let  J3  =  (3(k)  and  6  =  8(k)  be  the  estimators  of  /?  and  6  corresponding  to  k.  That 
is,  we  replace  Zq  by  Z2  with  k  =  k  and  then  estimate  model  (3).  We  shall  establish 
subsequently  that  0  and  6  are  root-T  consistent,  asymptotically  normal,  and  have 
the  same  limiting  distribution  as  if  the  shift  point  to  were  known. 

In  what  follows,  we  use  op(l)  to  denote  a  sequence  of  random  variables  converging 
to  zero  in  probability  and  0p(l)  to  denote  a  sequence  which  is  stochastically  bounded. 
For  a  sequence  of  matrices  Bj,  we  write  Bj  =  Op(l)  if  each  of  its  elements  is  0V{\). 
The  notation  ||  •  ||  is  used  to  denote  the  Euclidean  norm,  i.e.  ||i||  =  (5Zf=i  I?)1/'2 
for  1  €  TV-  For  a  matrix  A,  we  use  ||A||  to  denote  the  vector- reduced  norm,  i.e., 
||A||  =  supr^||Ax||/||x||. 

We  make  the  following  assumptions: 

Al.  ifco  =  [tT],  where  t  €  (0, 1)  and  [•]  is  the  greatest  integer  function. 

A2.  The  data  {(ytT,XtT,ZtT)\  1  <  t  <  T,  T  >  1}  form  a  triangular  array.  For 
notational  simplicity,  the  subscript  T  will  be  suppressed.  In  addition,  zt  =  R'xu 
where  R  is  p  x  9,  rank(iZ)  =  q,  zt  €  IV,  xt  €  TV,  q  <  p. 


A3.  The  matrix  J2t=i  xtx't  is  positive  definite  for  large  values  of  \i  —  j\  and 
sup;>i  j  T,ttu  xtx't\  is  stochastically  bounded  for  every  fixed  I.  Furthermore, 
\\-  Yfttu  xtx't\\  1S  bounded  away  from  zero  for  large  ;'. 

A4.  j(X'X)  — ►  Qxx,  where  Qxx  is  finite  and  positive  definite. 

A5.  -jtj  Ylt=i  xt£t  =*■  -^(5)i  where  B(s)  is  multivariate  Gaussian  process  with 
zero  mean  and  covariance  matrix 

E{B(s1)B(s2)'}=n(SlAsl) 

where 

G.(*\  =  

T-oo  T 


i    lTi][T,) 

n(s)=Km-'£'£E{xix'jeiej). 

i=l ;=1 


The  notation  "=»"  stands  for  the  weak  convergence  in  D[0, 1]  with  the  Skorohod 
topology,  see  Pollard  (1984)  and  "x  A  y"  stands  for  the  minimum  value  of  x  and  y. 
A6.  For  some  real  number  r  >  2  and  constant  C  >  0, 


j^XtCt 


t=l 


<  C{j  -  z')r/2     for  all  1  <  i  <  j  <  T. 


Assumption  Al  assumes  that  the  shift  point  is  bounded  away  from  the  end  points. 
This  assumption  can  be  slightly  more  general,  for  example,  &o  =  ['TrT'h  where  tj  — ►  r 
and  tj,t  €  (0,1).  Assumption  A2  allows  for  trending  regressors  written  in  the 
form  {t/T)e  (£  >  0)  or,  more  generally,  written  as  any  function  of  the  time  trend, 
g(t/T).  Expressing  trending  regressors  in  this  way  avoids  a  scaling  matrix  that 
would  otherwise  be  required  when  deriving  limiting  distributions.  The  first  half  of 
A3  requires  that  a  sum  involving  an  increasing  number  of  observations  must  become 
positive  definite.  The  latter  half  is  typically  implied  by  the  strong  law  of  large 
numbers.  Assumptions  A4  and  A5  are  standard  for  linear  regressions.  Note  that 
independence  of  the  errors  is  not  assumed.  A5  is  satisfied  for  a  wide  range  of  settings, 
see  e.g.,  Wooldridge  and  White  (1988).  Assumption  A6  is  satisfied  for  xtet  that  are 
martingales  or  strongly  mixing  sequences  with  some  moment  conditions,  see,  e.g., 
Yokoyama  (1980)  and  Andrews  and  Pollard  (1990). 
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Define  f  =  k/T.  We  shall  show  that  f  is  consistent  for  r  and  prove  that  T(t-t)  = 
0p(l)  in  the  following  section. 

3     The  Consistency  of  f 

In  this  section,  we  first  establish  the  consistency  of  the  W-estimator  and  then  ob- 
tain its  convergence  rate.  The  latter  result  is  also  used  for  obtaining  the  limiting 
distribution  of  the  W-estimator  as  well  as  the  limiting  distribution  of  the  estimated 
regression  parameters.  First,  the  consistency  result: 

Proposition  2  Under  assumptions  A1-A5,  for  any  e  >  0  and  n  >  0,  there  exists 
To  >  0,  when  T  >  TQ, 

P(\r  -  r\  >  n)  <  e. 

To  prove  this  proposition,  we  define 

VT(k)  =  6'k(Z'2MZ2)6k. 

By  Proposition  1,  k  =  argmaxfc  Vj{k).  To  obtain  consistency,  we  examine  the  global 
behavior  of  Vj{k).  We  shall  show  that,  if  6  ^  0,  then  with  high  probability,  Vj(fc) 
can  only  be  maximized  near  k0.  The  basic  idea  is  to  decompose  Vr(k)  —  Vr(ko)  into 
two  parts:  a  "deterministic"  part  and  a  stochastic  part.  The  deterministic  part  is 
maximized  near  ko,  whereas  the  stochastic  part  is  uniformly  small  (in  k)  relative  to 
the  deterministic  part.  To  see  this,  we  first  notice  that 

Sk    =    {Z'2MZ2)-x{Z'2MZ0)S  +  {Z'2MZ2)-1Z'2Me, 
8*    =    6  +  (ZLMZo)-lZ'QMe, 


therefore 


VT(k)-VT(ko)    =    6k(ZWZ2)6k-6ko(Z'QMZ0)6ho  (6) 

=    6'{(Z0MZ2)(Z'2MZ2)-l(Z'2MZo)-(Z0MZ0)}6         (7) 

+h(k,S,e),  (8) 


where 

h(k,6,e)    =    26'{Z'QMZ2){Z2MZ2)-xZ2Me-26'Z'QMe  (9) 

+    e'MZ2{Z'2MZ2)-lZ'2Me-e'MZo{Z'QMZ0)-xZ'QMe.         (10) 

Expression  (7)  constitutes  the  deterministic  part  and  h(k,  8,  e)  constitutes  the  stochas- 
tic paxt.  Denote 

X&  =  X2  -  X0  =  (0, ...,0,ifc+i,...,Xjkp,0,...,t))'        for  k  <  k0 

XA  =  -(X2-Xo)  =  (0,...,0,Iito+1,...,xfc,0...,0),        for   k  >  k0 

and  define  X&  to  be  a  zero  vector  if  k  =  k0  so  that  X2  =  X0  +  Xasign(fc0  —  k).  When 
the  distinction  between  the  positive  and  negative  signs  is  immaterial,  we  simply  write 
X2  =  X0±X±.  Set  ZA  =  XAR,  and  let 

m      *'{(^MZq)  -  {Z'0MZ2){Z'2MZ2)-\Z'2MZ0))6  ' 

7W  = \k^k\ •  (U) 

When  k  =  k0,  both  the  numerator  and  denominator  of  f(k)  are  zero.  In  that  case 
we  arbitrarily  define  7(^0)  =  S'6.  Note  that  *y(k)  is  non-negative  because  the  matrix 
inside  the  braces  is  semi-positive  definite.  We  have  the  following  identity 

VT(k)-VT(ko).--\ko-k\i(k)  +  h{k,S,£)       for  all  k.  (12) 

The  shift-point  estimator  k  must  satisfy  Vr{k)  >  Vr{ko),  or  equivalently,  h(k,8,  e)  > 
l^o  —  k\f(k).  Thus  we  have 

F(f-r|>77)  =  JP(|fc-*0|>rJ7) 

<p(    sup     \h(k,6,e)\>t    inf      |*o  -  *b(*0 ) 

<p(    sup     \h(k,6,e)\>TV     inf      7(*)) 

\p<k<T-j>  \ko-k\>Tv  J 

=  P(lTl     sup     r-1|Mfc,*,e)l>'7) 
p<jt<r-p 
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where 


7r  =       inf      -f(k)  >  0. 

l*-*ol>ru 


Lemma  A. 2  shows  that  77  is  positive  and  bounded  away  from  zero.  Thus  consistency 
will  follow  from 

J"1     sup     \h(k,8,e)\  =  op(l).  (13) 

p<k<T-p 

Next  we  verify  (13).  In  fact,  we  shall  show  that  (13)  is  Op{T~l/2  log(T)),  a  stronger 
result  than  needed.  For  each  fixed  k,  it  is  not  difficult  to  see  that  h(k,6,s)  grows 
at  most  at  the  rate  of  \/T .  It  follows  that  h  divided  by  T  converges  to  zero  in 
probability.  However,  this  is  not  enough.  We  must  prove  that  the  maximum  value 
of  h  taken  over  all  possible  k  grows  at  a  slower  rate  than  T.  This  is  indeed  the  case. 
Here  is  the  precise  argument.  Divided  by  T,  the  first  term  of  the  RHS  of  (9)  can  be 
written  as 

26'^=^=(ZiMZ2)(Z'2MZ2)-l/2(Z'2MZ2)-^Z2Me.  (14) 

The  law  of  iterated  logarithm  implies  that 

sup  \\(Z2.MZ2y1/2Z2Me\\  =  Op(log  T).  (15) 

k 

where  the  supremum  is  taken  over  all  values  of  k  such  that  p  <  k  <  T.    Next  we 

show  Bt  —  A^{ZqMZ2){Z2MZ2)~xI2  =  Op(l)  uniformly  in  k,  or  equivalently, 

supBTB'T  =  sup  T-l{ZoMZ2)(Z2MZ2)-l{Z'MZo)  =  Op(l). 
k  k 

But  the  above  follows  from  (Z'0M Z2)(Z'2M Z2)~x (Z'2M ZQ)  <  {Z'QMZ0)  for  all  k  and 
T-l{Z'QMZQ)  =  Op(l).  Thus  (14)  is  Op{T-1'2  log  T).  The  second  term  of  (9)  divided 
by  T  is  T~lZ'QM£  =  Op(T-1/2).  From  (15),  suPjt  T-xe'MZ2{Z2MZ2)-'Z'2Me  = 
Op(T-1(log  T)2),  which  corresponds  to  the  first  term  of  (10).  The  last  term  of  (10)  di- 
vided by  T  is  obviously  Ov{T~l).  Combining  these  results,  we  have  T~l  supt  \h(k,  8,  e)\ 
Op{T~l,2\ogT).  We  thus  establish  the  consistency  of  the  W-estimator. 

It  should  be  pointed  out  that  the  consistency  is  obtained  by  globally  searching  the 
Wald-type  statistics.  The  consistency  argument  does  not  require  that  the  searching 
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be  limited  within  a  positive  fraction  of  the  data  points  such  that  k  €  [eT,  (1  —  e)T}. 
The  implication  is  that  the  sequence  of  Wald  type  statistics  is  well  behaved  even  at 
the  two  ends,  attaining  its  global  maximum  near  the  true  shift  point.  In  a  recent 
paper  by  Nunes,  Kuan,  and  Newbold  (1993),  they  prove  the  consistency  but  without 
obtaining  any  rate  of  convergence.  Also  their  arguments  require  restricted  searching. 
Next  we  establish  the  T-consistency  result: 

Proposition  3  Under  assumptions  A1-A6,  for  any  e  >  0,  there  exists  a  C  >  0, 
such  that  for  large  T , 

P  (\T{f  -  t}\  >C)  =  P  (|ifc  -  Jfcol  >  C)  <  e. 

Again  because  Vr(k)  >  Vr(kQ)  by  definition,  it  suffices  to  show 

P  (    sup     VT(k)  >  Vr(ko)]  <  «.  (16) 

\l*-*ol>C  / 

By  Proposition  2,  for  any  e  >  0  and  a  >  0,  we  have  P(\k  —  k0\  >  Ta)  <  t  for  large 
T.  Thus  to  prove  (16),  it  is  sufficient  to  show  that 

Pi=p(  sup    VT(k)  >  VT(k0))  <  e 

where  K(C)  =  {k  :  \k  —  k0\  >  C  and  Ta  <  k  <  (1—  a)T}  for  a  small  number  a  >  0. 
Finding  the  maximum  value  over  the  set  K(C)  amounts  to  restricted  searching, 
but  this  is  legitimate  only  after  establishing  consistency.  Note  Vr(k)  >  Vr{kQ)  is 
equivalent  to 

Thus 

h(k,6,e) 


P\  <  P[    sup 
\keK(C) 


>     inf    7(fc)V 


Jcq  —  k 

By  Lemma  A.2,  inffc6K-(c)  l(k)  =  fcr,  which  is  bounded  away  from  0  for  large  C 
and  large  T.  Thus  it  suffices  to  show  that  for  any  fixed  A  >  0, 

h(k,  8,e) 


P2  =  P  \    sup 

\k€K(C) 


ko  —  k 
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>A  )<e  (17) 


when  C  is  large.  Consider  the  terms  in  (10).  For  k  €  K(C),  Z2  involves  a  pos- 
itive fraction  of  data  (at  least  Ta  observations),  thus  (Z2MZ2/T)~X  =  Op(l)  by 
assumptions  A2-A4  and  T~xl2Z'2Me  =  Op(l)  by  assumption  A5,  where  Op{\)  is  uni- 
form in  k.  Thus  e' M Z2{Z'2M Z2)~x Z'2M e  -  Op(l)  uniformly  on  K{C).  Similarly, 
e'MZQ{Z'0MZo)-lZ'0Me  =  Op(l).  Therefore  (10)  divided  by  \k0  -  k\  is  bounded  by 
Op{l)/\k0  -  k\  <  Op{l)/C  because  \k0  —  k\  >  C.  Choose  C  large  enough  so  that 
P(Op{l)/C  >  A)  <  e/3  for  any  pre-given  A  >  0. 

Next  consider  (9).  Use  Z2  =  Z0  ±  Z&  to  deduce  that 

S\Z'QMZ2){Z'2MZ2)-xZ'2Me 

=  6'Z'2Me  ±  6'(Z'AMZ2){Z'2MZ2)-lZ'2Me 

=  6'Z^Me  ±6'Z'^Me  ±6'{Z':,MZ2){Z2MZ2)-1Z'2Me.  (18) 


Therefore, 


6'{Z'QMZ2)(Z'2MZ2)-xZ'2Me  -  8'Z'0Me 


ko  —  k 
\8'Z'AMe\  +  \h\Z^MZ2){Z'2MZ2)-xZ'2Mt\ 


S'Z'^Me 


+ 


\ko-k\ 
6'{Z'AMZ2)  (Z2MZ2\~X  (Z'2Me 


ko  —  k 
Now  by  assumptions  A3  and  A4, 


ko-k 


-l 


6'{Z'AMZ2)      Z'^Zt^      Z'^X*  (X'Xy1  X'Z2 


0P(1), 


(19) 
(20) 
(21) 


ko  —  k  ko  —  k      ko  —  k  \    T    J         T 

which  together  with  the  functional  central  limit  theorem  implies  that  the  second 
term  of  (21)  is  Op(T-xl2).  Next,  Z'AMe  =  Z*e  -  Z'AX(X'X)-xX'e.  Suppose  k  <  k0 
(the  case  of  k  >  ko  is  similar),  then 

*0 


1 


ko  —  k 


Z'Me    = 


1 


1 


k      k  E  **-TTk  I  £  «<]  (X'X/T)'\X'£/T) 

K0~  K  t-k+i  Ko-  K    \t=k+1  J 


1 


ko 


*°  ~  K  tsfc+1 


Y,  **t  +  Ov{T-"2). 


13 


It  remains  to  be  shown  that  for  a  given  A,  there  exists  a  C  >  0,  such  that 

1 


P  I    sup 

,k<ko-C 


"Q      K  t=k+i 


>  A)  <t. 


By  Lemma  A. 3,  there  exists  a  L  >  0,  such  that  for  any  A  >  0  and  C  >  0, 


P  I     sup 

,k<ko-C 


1 


>A    < 


ArCr/2 


which  is  less  than  t  when  C  is  large  for  a  given  A.  This  proves  (17)  and  therefore 
Proposition  3. 

T-consistency  seems  to  be  typical  for  shift-point  estimators,  just  as  root-T  con- 
sistency is  typical  for  other  parameters  under  regularity  conditions.  T-consistency  is 
obtained  under  nonparametric  estimation  in  the  case  of  two  i.i.d.  samples,  see  Duem- 
bgen  (1991).  The  same  rate  of  convergence  is  also  established  for  LAD  estimation 
and  least  squares  estimation  in  a  strongly  mixing  sequence  or  a  linear  process  with 
a  shift  in  mean,  see  Bai  (1993a,  b). 

Given  the  convergence  rate  of  k,  it  is  relatively  easy  to  show  that  the  estimators  of 
$  and  6  are  not  only  root-T  consistent,  but  asymptotically  normal.  In  estimating  the 
regression  parameters,  we  use  k  in  place  of  Icq.  Because  of  the  fast  rate  of  convergence, 
treating  k  as  ko  does  not  affect  the  limiting  distribution  of  the  estimated  regression 
parameters.  Recall  that  0  =  0(k)  and  6  =  6(k).  We  have 

Corollary  1    Under  assumptions  A1-A6,  together  with  et  being  uncorrelated, 


where 


(M-!!)-™>n 


,.      1   (  T,t=ixtxt     Ht=koxtzi 
V  =  plim  —  \       „ 


(22) 


For  serially-correlated  disturbances,  the  variance-covariance  matrix  of  the  limiting 
distribution  is  given  by 

v-xuv~\ 
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where 

■„     ,.     1  /  Zl>i  jB(*««Je*«i)     £?■>*„  £(*^e,-e,)  \ 

£/  =  hm—  ~  I  (23) 

Methods  developed  by  Newey  and  West  (1987)  and  Andrews  (1991)  can  be  used  to 
estimate  the  matrix  U.  When  estimating  V  and  U,  one  uses  k  in  place  of  k0.  Notice 
that  a  and  $  have  the  same  limiting  distributions  as  if  ko  were  known.  The  conclu- 
sion here  is  that,  although  the  estimators  of  regression  coefficients  are  determined 
sequentially,  confidence  intervals  for  a  and  0  can  be  constructed  in  the  conventional 
way  (based  on  t-statistics).  This  conclusion  assumes  that  a  shift  does  in  fact  exist. 
We  required  that  the  vector  zt  be  a  sub-vector  of  xt  or  a  linear  transformation 
of  xt.  It  may  be  possible  to  have  a  regressor  in  zt  but  not  in  xt.  That  is,  only 
after  a  structural  change  does  a  new  variable  enter  into  play.  The  current  proof  does 
not  allow  this  scenario.  An  extension  to  cover  this  case  requires  similar  results  to 
Lemma  A.l  and  Lemma  A. 2  without  assuming  zt  =  Rxt.  One  may  explore  this 
possibility  by  using  the  identity  in  Lemma  A. 4.  Another  solution  is  to  include  the 
new  variable  in  the  matrix  M,  which  is  equivalent  to  assigning  a  zero  coefficient  to 
the  new  variable  and  including  it  in  xt.  But  this  method  produces  an  inefficient 
estimation  of  regression  parameters  and  consequently  an  inefficient  estimation  of  the 
shift  point  as  well. 

i 

4     Asymptotic  Distribution  and  Confidence  Sets 

We  now  consider  the  limiting  distribution  of  the  W-estimator  for  small  shifts  (shifts 
of  shrinking  sizes  as  T  increases).  There  are  two  reasons  for  this.  One  is  the  technical 
intractability  for  shifts  of  fixed  magnitude.  The  limiting  distribution  is  highly  data 
dependent  for  fixed  shifts  and  thus  difficult  to  obtain.  The  result  of  Hinkley  (1970) 
indicates  that,  even  for  an  i.i.d.  normal  sample  with  a  mean  shift,  the  limiting 
distribution  is  enormously  complicated.  Second,  because  of  this  data  dependence, 
the  limiting  distribution  is  of  less  practical  importance.  The  approach  we  take  here 
is  similar  to  that  of  Picard  (1985)  and  Yao  (1987).  We  find  a  limiting  distribution  for 
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small  changes.  This  limiting  distribution  can  then  be  used  as  an  approximation  to 
the  underlying  distribution  for  moderate  shifts.  Of  course,  the  approximation  may 
not  be  satisfactory  for  large  shifts.  Nevertheless,  the  limiting  distribution  provides 
insight  on  how  serial  correlation  affects  the  precision  of  the  shift  point  estimator  and 
in  what  way  the  magnitude  of  the  shift  and  regressors  influence  the  precision  of  the 
shift  point. 

We  assume  the  magnitude  of  shifts,  \\6\\,  depends  on  T  such  that 

||6r||  -4  0,      Vf\\6T\\  -  oo.  (24) 

In  the  previous  section,  we  treated  6  as  a  constant  not  varying  with  T  and  established: 

fc  =  fco  +  0„(l). 

With  some  modifications,  we  can  show 

k  =  ko  +  Op(\\6T\\-2).  (25) 

We  omit  the  details  here.  Similar  results  may  be  found  in  Bai  (1993a,b).  Given  the 
convergence  rate  (25),  the  limiting  distribution  of  k  may  be  obtained  by  the  local 
weak  convergence  of  Vr{k)  —  Vr{ko)  in  conjunction  with  the  continuous  mapping 
theorem  for  the  argmax  functional.  By  the  local  weak  convergence  we  mean  the 
weak  convergence  when  k  varies  in  a  neighborhood  of  k0  such  that  k  =  k0  +  [uA^2], 
where  X\  =  0(||6r||2)  an<^  u  is  a  real  number  in  a  compact  set.  Let 

KT(V)  =  {k:  k  =  kQ  +  [v\t%  \v\  <  V}. 

For  any  given  V  >  0,  we  derive  the  limiting  process  of 

VT(ko  +  M?2])  -  VT(ko) 

for  v  €  [— V,  V].  Suppose  Vr(ko  +  [vX??])  -  Vj(fco)  =*■  G(v),  the  continuous  mapping 
theorem  then  implies  Xj(k— ko)  — ►  argmaxvG(v).  For  a  discussion  on  the  continuity 
of  an  argmax  functional,  we  refer  to  Kim  and  Pollard  (1991). 

On  Kt(V),  we  have  a  more  simplified  expression  for  Vr(k)  —  Vr(ko). 
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Proposition  4   Under  assumptions  of  A1-A6 

VT(k)  -  VT(ko)  =  -S'jZ'^Z^t  ±  26TZ'Ae  +  op(l) 
w/iere  op(l)  is  uniform  on  Kj{V). 

The  proof  is  provided  in  the  Appendix.  Thus  the  limiting  process  of  Vj(jfc)  —  Vj(/fc0) 
is  determined  by  —  6tZ'aZ&6t  ±  ISjZ'^e.  We  consider  two  leading  cases. 

Case  1:  Nontrending  regressors  but  possibly  correlated  errors.  Assume  that  the 
regressors  satisfy 

Y  [T>]  i    T    T 

plimT-00  -  £  Ziz'x  =  sQ     and      plim-^^  -  £  £  E(z,-zJ-e,-Cj )  =  Q 

■*  1=1  J  t=ij=i 

for  all  r.  Let  Aj  =  8'tQ6t-  Consider  the  limit  of  S^Z'^Z^Sj  for  k  =  Icq  +  [uAj2].  Now 

OjZ^Z^Ot  —  AT   —3 

Since  Z'^Z^  involves  |[uAj2]|  (the  absolute  value  of  [vAj2])  observations  of  zt,  we 

have 

7'.  7. 

MO 


i-a 


uniformly  in  v  €  [—V,  V]  (note  that  A^    — »  oo).  This  implies 

8'tZ'aZ&6t  —*  \v\ 

uniformly  in  u  €  [—V,  V]. 

Next  consider  the  limit  of  6'TZ'^e.  Assume  v  <  0,  that  is,  k  =  k0  +  [vX?2]  <  k0. 
By  the  functional  central  limit  theorem  of  A5  (re-scaled), 

S'TZ'Ae  =  6'T  £  z«et  =»  ^Wi(-») 

t=fc+i 

where  W^-)  is  a  Brownian  motion  process  on  [0,oo)  with  Wi(0)  =  0  and 
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The  weak  convergence  is  in  the  space  D[-V,0],  the  set  of  cadlag  functions  denned 
on  [-V,0],  endowed  with  the  Skorohod  topology  [see  Pollard  (1984)].  Similarly,  when 
Jfc  >  ko  (i.e.,  v  >  0),  we  have 

fc 

t=ko+l 

where  W2(i;)  is  another  Brownian  motion  process  on  [0,  oo)  with  ^(0)  =  0.  The  two 
processes  W\  and  W2  are  independent  because  they  are  the  limits  of  non-overlapping 
disturbances  that  are  only  weakly  dependent.  Let  W(v)  =  W\{— v)  for  v  <  0, 
W(v)  =  0,  and  W(v)  —  W2(v)  for  v  >  0,  a  two-sided  Brownian  motion  process  on 
(— oo,+oo).  Then  combining  these  results,  we  have 

VT(k0  +  [v\t2])  -  VT(k0)  =>  -|v|  +  2<t>W{v). 

From  the  continuous  mapping  theorem  for  the  argmax  functional, 

\2T{k  -  ko)  -i*  argmaxJ-M  +  2<j>W{v)}  =  ^aargmaxv{W(t;)  -  |w|/2}. 

The  last  equality  follows  from  a  change  in  variable.  Denote  the  random  variable 
argmax„{W(i;)  —  |u|/2}  by  V*,  its  density  function  is  given  in  (35).  In  view  of  the 
definition  for  Aj  and  <f>2,  we  can  write 

{6'TQ6T)2{6'Tn6T)-l{k  ~  M  -1*  V  (26) 

For  uncorrelated  errors,  because  ft  =  <r2Q,  (26)  becomes 

(Jl9hl{k  -  fc)  -£♦  V-        ■  (27) 

A  special  case  is  a  shift  in  the  intercept  only.  In  this  situation,  zt  =  1,  so  Q  =  1.  It 
follows  that  #(ib  -  ifco)  -*->  <r2V\ 

Case  2:  Trending  regressors.  Assume  z<  =  g(t/T)  =  (gi(t/T),...,gq(t/T))'  and 
the  functions  ft(x)  have  bounded  derivatives  on  [0, 1].  Let  A?r  =  S'Tg(T)g(T)' '8j  which 
is  0(||6t||2).  Let  us  first  show  that 

S'tZ'^ZoSt  -  \v\ 
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uniformly  in  v  €  [— V,  V].  Consider  the  case  of  v  <  0  (i.e.  k  <  ko).  Then 

S'TZ'^ST    =    6T  £  g(t/T)g(t/T)'6T  (28) 

t=k+i 

=    {ko  -  k)6'Tg{T)g{T)'6T  (29) 

+8'T   £  WIT)  -  9(r)][g(t/T)  -  g(r)}'6T  (30) 

t  =  k+l 

+26'T   £  [g{tlT)-g{T)\g{r)'6T  (31) 

Expression  (29)  is  equal  to  {ko  —  k)\\  =  —  [uAj2]A^,  which  converges  to  —  v  uniformly 
in  v  €  [— V,0\.  We  next  argue  that  (30)  and  (31)  are  op(l)  uniformly  on  Kt(V). 
Note  that  (30)  is  bounded  by 

dg(x) 


sup 

X 


dx 


2  *o 

8'T6T  £  (<  -  hf/T2  <  Bx\\6r\?{ho  -  kf/T2  <  B7(T2\\6t\\*)-1  -.  0, 

t=k+l 

(32) 


for  some  constants  B\  and  5j.  The  proof  that  (31)  is  also  op(l)  is  similar. 

Next 

STZ'Ae  =  S'T  £  g(t/T)et  =  6'Tg(ko/T)  £  et  +  4  J)  [j(i/T)-5(VT)kt.  (33) 

Suppose  the  £j  are  uncorrelated,  then  the  variance  of  the  second  term  on  the  right 
hand  side  is 

*%  £  ^)-</(wr)][j(t/r)-s(io/r)]^ 

which  converges  to  zero  by  (30)  and  (32).  Thus  the  limiting  distribution  of  (33)  is 
determined  by  6'Tg(r)  J2tLk+i  et-  Because  k  =  k^  +  [uAf2]  and  A?r  =  6'Tg(T)g(T)'6T, 
by  the  functional  central  limit  theorem  (see  A5), 

6'T9(t)  £  £<  =►  »Wk(-v). 

The  argument  for  v  >  0  is  similar.  In  particular, 

^Z^e  =►  <rW2{v) 
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where  W2{-)  is  another  Brownian  motion  process  independent  of  W\{-).  Define  W(v) 
as  in  the  previous  case,  we  have 

VT(ko  +  [uAf2])  -  VT{ko)  =>  -\v\  +  2aW(v). 

This  implies,  by  the  continuous  mapping  theorem  and  a  change  in  variable, 

\f(  k  -  ko)  _^  argmaXu{v^(u)  _  |u|/2})  (34) 


a 


2 


*2  =  }™liZZE(^)- 

1=1  J=l 


where  A?r  =  f>T9{T)9{T)'fo-  For  stationary  and  serially  correlated  errors,  the  above 
convergence  still  holds  but  with  a2  replaced  by 

h      h 
h->oo  h 

Confidence  Intervals.  To  construct  confidence  intervals  for  k0,  we  use  the 
asymptotic  results  (26),  (27),  and  (34).  Under  case  1  (non-trending  regressors) 
together  with  uncorrected  errors,  the  limiting  distribution  of  k  is  given  by  (27). 
The  distribution  function  for  the  random  variable  V  —  argmax„{W(t;)  —  |u|/2}  is, 

G(x)  =  1  +  W1'2^-*'*  -  l-{x  +  5)*(-v^/2)  +  |ex*(-3v£/2)  (35) 

for  i  >  0  and  G(x)  =  1  —  G(—  x)  [see  Yao  (1987)],  where  $(x)  is  the  distribution 
function  of  a  standard  normal  random  variable.  So  quantiles  for  this  random  variable 
are  easy  to  obtain.  All  we  need  are  estimates  for  Xj  =  S'tQ6t  and  for  the  variance 
a2.  However,  6j  is  estimable,  see  Corollary  1.  The  matrix  Q  can  be  simply  taken 
as  fJ2t=i  ztzt-  The  variance  a2  can  be  estimated  by  a2  =  ^5(fc).  Similar  to  the 
result  of  Corollary  1,  it  is  not  difficult  to  show  b2  is  a  root-T.  consistent  estimator 
of  a2.  For  serially  correlated  errors,  one  also  needs  to  estimate  the  matrix  Cl  and 
the  methods  of  Newey  and  West  (1987)  and  Andrews  (1991)  can  be  used.  For 
trending  regressors,  Ay  =  S'Tg{T)g{T)' Sj .  The  matrix  </(f )<7(r)'  can  be  estimated 
by  g(k/T)g(k/T)',  which  only  uses  the  k-th.  observation.  All  quantities  necessary 
for  constructing  confidence  intervals  are  easy  to  obtain.  A  100(1  —  a)%  confidence 
interval  is  given  by 

[k  -  a-2co/2/A^,  k  +  a2ca/2/\T] 
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where  ca/2  is  the  a/2  percentile  of  the  distribution  of  V. 

Because  the  estimated  change  point  k  is  close  to  kQ,  one  can  estimate  Aj  in  either 

case  by 

.    (  \     k+m         \  . 

M^  ?  2H*t 

\  £=fc-m  / 

for  some  m  >  0.  In  the  case  of  trending  regressor  such  that  zt  =  g(t/T),  it  is  not 
difficult  to  show  that  ^  (EJ2U,  2:2<)  ~>  2(r)s(r)'>  for  each  fixed  m.  This  is  also 
true  for  m  growing  unbounded  but  satisfying  m/T  -*  (£,  for  example,  m  =  \ff .  If  z, 
contains  no  trending  regressors,  then  j^  (E*^m  z'tzt)  should  be  close  to  Q  provided 
m  is  relatively  large.  Finally,  the  confidence  intervals  for  regression  parameters  0 
and  6  are  constructed  in  the  usual  way. 

5      Concluding  Remarks 

We  have  considered  the  structural  change  problem  in  a  linear  regression  model  where 
part  or  all  of  the  coefficients  have  a  shift  occurring  at  an  unknown  time.  The  unknown 
shift  point  is  estimated  by  maximizing  a  sequence  of  Wald  statistics.  We  established 
the  T-consistency  for  the  estimated  shift  point.  In  the  case  of  partial  structural 
change,  the  unchanged  regression  parameters  are  estimated  with  the  whole  sample, 
whereas  the  shifted  parameters  are  estimated  with  subsamples.  With  the  presence 
of  structural  change,  we  have  demonstrated  that  the  sequence  of  Wald  type  statistics 
are  well  behaved  in  the  sense  that  the  sequence  achieves  its  global  maximum  near 
the  true  shift  point.  Thus  the  maximization  is  done  via  global  searching.  Confining 
the  search  for  a  maximum  in  the  middle  of  the  sequence  (ignoring  some  fractions  of 
the  sequence  in  the  beginning  or  at  the  end)  is  not  necessary.  We  also  noted  that 
for  a  linear  regression  with  least  squares  estimation  the  shift-point  estimator  can 
be  equivalently  obtained  by  maximizing  a  sequence  of  F-statistics,  LM-statistics,  or 
LR-statistics.  When  only  a  single  parameter  is  allowed  to  change,  the  estimator  can 
be  obtained  by  maximizing  a  sequence  of  t-statistics  in  absolute  value  as  well.    In 
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addition,  we  studied  the  asymptotic  distributions  for  the  shift-point  estimator  as 
well  as  the  regression-parameter  estimators.  These  results  enable  one  to  construct 
confidence  intervals.  Most  importantly,  the  asymptotic  results  provide  insight  on  how 
the  precision  of  the  shift-point  estimator  depends  on  the  regressors,  the  magnitude 
of  the  shift,  and  serial  correlations  in  the  data.  Our  results  hold  for  a  wide  range 
of  regressor  designs,  with  time  trend  and  stochastic  regressors  as  special  cases.  We 
also  permit  heterogeneous  and  dependent  disturbances. 

It  will  be  interesting  to  extend  the  argument  of  this  paper  to  the  setup  considered 
by  Andrews  (1993),  who  employs  Wald,  LM  and  LR  statistics  to  test  for  the  exis- 
tence of  a  structural  change  in  nonlinear  models.  These  statistics  are  constructed 
using  instrumental  variable  estimation  or  more  generally  the  Hansen-type  GMM  es- 
timation. By  maximizing  a  sequence  of  these  statistics,  one  also  obtains  a  shift-point 
estimator.  An  important  unresolved  topic  is  the  consistency  of  the  resulting  estima- 
tor. Much  work  is  needed  to  make  the  analysis  of  this  paper  applicable  for  nonlinear 
models  and  for  models  with  nondifferentiable  objective  functions  such  as  LAD.  We 
hope  this  paper  will  stimulate  further  research  in  this  area. 
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A     Appendix 

Lemma  A.l    The  following  two  inequalities  hold: 

{Z'0MZQ)  -  {Z'QMZ2){Z'2MZ2)-X{Z2MZ0) 

>R,{X'£,XA){X'2X2rx{X'0XQ)R,        k<k0  (36) 

(Z^MZo)  -  {Z'0MZ2){Z'2MZ2)-l{Z2MZo) 

>  RfX'^XA(X'X  -  X'2X2)-\X'X.-  X^X0)R,     k  >  kQ.     (37) 

Proof  of  (36).  Write  H  =  {X'2X2)-X  -  {X'X)~X.  For  k  <  jfco, 

(Z'0MZ2)(Z2MZ2yx(Z2MZ0)  =  (38) 

R/(X'0X0)H(X2X2)R  {R(X'2X2)H{X'2X2)R}-1  R!{X'2X2)H{X'0XQ)R.   (39) 

Let  A  =  HX/2(X2X2)R.  Since  I  -A{A'A)~XA'  is  a  projection  matrix,  we  have 
/  -  A{A'A)-XA'  >  0.  Multiplying  this  inequality  by  R'(X'QXQ)Hxl2  from  the  left  and 
multiplying  from  the  right  by  HxI2(X'qXq)R,  we  obtain 

R!{X'qXq)H{X'qXq)R-  R!(X'0Xo)H112 A{A' A)~x A' Hl'2{X'QXQ)R  >  0. 

The  second  term  above  is  identical  to  (39).  Thus  it  suffices  to  show 

Z^MZo  -  R,(X'QXo)H(X,0X0)R  >  R,{X'£,XA)(X^X2)-X(X'0X0)R.         (40) 

In  fact,  the  equality  holds  in  (40)  because  the  left  hand  side  of  (40)  is 

R!{X'QXo)  [{X'QX0)-X  -  (X'X)-X  -  H]  {X'0Xq)R 
=  R!{X'0XQ)[{X'QXo)-x  -  (X'2X2rl](X'QXo)R 
=  R!{X'M{X2X2)-X{X'0XQ)R 

The  last  equality  follows  from  X2X2  =  X'QXQ  +  X'AXA.  The  proof  of  (37)  is  similar 
and  is  omitted. 
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Lemma  A. 2  Under  assumptions  A2-A4,  for  large  C  >  0,  for  is  positive  and 
bounded  away  from  zero,  where 

for  =      inf     f(k) 
\k-k<,\>c  IK  ' 

and  i{k)  is  defined  in  (11).  That  is,  lim inf t~<x  Icr  =  7c  >  0  in  probability.  Also 
notice  that  for  any  number  n  >  0, 

7T  =      inf      j(k)  >      inf     ~t(k) 

\k-ko\>Tr,  -  |*-*o|>C     V    ' 

because  Tn  >  C  for  large  T. 

Proof:  For  k  <  k0,  by  Lemma  A.l  part  (i), 

7(jfc)  >  8'B!^±{X'2X2)-\X'QX0)R6.  (41) 

If  X'AX&  is  positive  definite  then  the  RHS  of  (36)  is  positive  definite  since  it  can 
be  written  as  R'[(X0X0)-X  +  (X'^X^-^R.  Thus  if  X'AXA  >  0,  the  LHS  of 
(41)  is  positive.  Note  that  X'^X^  is  in  fact  positive  definite,  when  k  and  ko  are 
far  apart  (i.e.,  when  X&  contains  enough  nonzero  elements).  Also  when  k  <  k0, 
£~3£ X'^ X&  =  £j3£  5Zjlfc+1  xtx't  is  bounded  away  from  zero,  by  A2  and  A3,  for  all  k 
such  that  ko  —  k  is  large.  In  addition, 

/  v  v  \-i/  v  v  \  ~    °  i  ^2-^2  \       X0Xo 

[A2A.2)      (A0AoJ  = 


T-k  \T-kJ      T-k0 

is  bounded  away  from  zero  for  all  k  <  ko.  Thus  we  can  choose  C  sufficiently  large 
so  that 

,Jgf>c  {**%=!  ww«) 

is  bounded  away  from  zero.  The  case  for  k  >  ko  can  be  proved  similarly. 
Lemma  A. 3   Under  assumption  A6,  there  exists  L  >  0  such  that  for  any  A  >  0, 


P  j  sup  - 


Y,2tet 


t=l 


>  A    < 


m 


7M' 
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Proof:  This  is  essentially  proved  by  Serfling  (1970)  (Theorem  5.1.).  Details  are 
omitted.  Note  for  ztet  i.i.d.  or  martingale  differences,  the  lemma  is  the  Hajek  and 
Renyi  (1955)  inequality.  Extension  of  the  Hajek  and  Renyi  inequality  to  a  linear 
process  of  martingale  differences  is  given  in  Bai  (1993a). 

Proof  of  Corollary  1.  Let  us  use  Zq  to  denote  Zq  when  k0  is  replaced  by  k.  Then 
the  estimators  $  and  6  are  obtained  by  regressing  y  on  X  and  Zq.  Equation  (3)  can 
be  written  as 

y  =  X/3  +  Z06  +  em 


with  e'  =  e  +  (Z0  —  Zq)S.  Proceeding  in  the  usual  way,  we  have 

*1_  /  X'X    X'Zo 
T  \  ZQX    ZqZq 

All  we  have  to  show  is  that  the  limit  of  the  right-hand  side  is  the  same  as  the  limit 
when  Zo  =  Zq.  Let  us  show 


*({:»  *""' 


J_(  X'e  +  X'(Zo-Z0)6 
VT\  Z'Qe  +  Z'0{Zo-Z0)6 


plim-^X'(Z0  -  Zo)6  =  0.  (42) 

Supposing  k  <  ko  (the  case  of  k  >  k0  is  similar),  we  have 

-Lr(Z0  -  Z0)  =  ^=   E  xtz't.  (43) 

Since  the  sum  only  involves  ko  —  k  terms,  and  ko  —  k  =  Op(l)  by  Proposition  3, 
(43)  converges  to  zero.  The  zero  limit  for  (43)  certainly  implies  plim  fX'Z0  = 
plim  fX'Zo-  All  other  terms  involving  Zq  can  be  treated  similarly  and  the  derivations 
are  omitted. 

Remarks:  Corollary  1  not  only  holds  for  fixed  6,  but  also  holds  when  \\$t\\  — *  0 
and  -v/ril^Tll  —^  oo,  as  in  the  setup  of  Section  4.  In  this  case,  k  =  k0  +  ||<$r||"2Op(l), 
and  (42)  can  be  proved  as  follows.  Note: 

"jrx'{Z°  ~  *°m-  vm\11  ,£I,2;" mi'- 

Since  the  sum  involves  about  ||<Jt||~2  terms,  ||  Ejli+1  xtz't\\  \\6T\\2  =  Op(l).  By  as- 
sumption, (■v/rPrl!)-1  — *  0,  and  so  the  desired  result  follows. 
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Lemma  A. 4   The  following  identity  holds 

S'T{{Z'0MZ0)  -  (Z'0MZ2)(Z'2MZ2)-\Z'2MZo)}8T  = 
6'T(Z'±MZA)6T  -  ST{Z'AMZ2)(Z'2MZ2)-1{Z^MZA)6T. 
The  proof  follows  simplify  from  the  fact  that  Z'0MZ2  =  Z2MZ2  ±  Z'±MZ2. 
Lemma  A. 5   Under  the  assumptions  of  A1-A6, 

(z)    ||r-V2*H^*)||  =  op(i) 

(zz)     \\T-l'%(Z'tMZ2)\\=op(l) 

(Hi)    ||T-1/VA/ZA||=op(l) 

(iv)    e'MZ2{Z'2MZ2)-lZ'2Me  -  e'MZQ{Z'QMZQ)-xZ'QM£  =  op(l) 

where  the  op(l)  is  uniform  on  Kj{V). 

Proof  of  (i). 

11^(^)11  =  11^^^11 

<£Mop(i)  =  _L_oP(i),0p(i). 


Proof  of  (ii) 


-l 


-j=6'TZ'±MZ2  =  -j=6'TZ±Z2  -  — 7=6'T(Z'±X)  (  -jr~j     ~Y~' 

By  part  (i),  the  second  term  is  op(l)Op(l)  =  op(l).  That  the  first  term  is  op(l)  can 
be  proved  in  exactly  the  same  way  as  in  part  (i)  (Note  that  Z'^Z2  equals  Z'^Z^  for 
k  <  kQ  and  equal  to  zero  for  k  >  ko). 
Proof  of  (iii). 
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Because  e'Z&  consists  of  only  A^2  observations,  the  functional  central  limit  theorem 
implies  that  At||£'^a||  =  0p(l).  This  together  with  y/TXj  -*  oo  implies  that  the  first 
term  on  the  right  is  op(l).  Consider  the  second  term.  Because  X'Z&/(kQ  —  k)  =  Op(l) 
and  \(ko-k)\/T  =  o(l),  it  follows  that  (X'ZA)/T  =  op(l),  where  Op(l)  and  o(l)  both 
being  uniform  on  KT(V).  Finally,  from  e'X/y/f  =  Op(l)  and  (X'X/T)"1  =  Op(l). 
the  second  term  is  0P(1)  •  op(l)  =  op(l). 
Proof  of  (iv).  Use  Z2  =  Zo  ±  Z&  to  obtain 

e'MZ2{Z'2MZ2)-xZ2Me  = 

£'MZ0{Z2MZ2)-1ZoMe  +  e' '  MZ±{Z2MZ2yxZ2Me  +  £'MZQ{Z'2MZ2r1Z^Me. 

The  result  of  (iii)  imphes  that  the  last  two  terms  above  are  op(l)0p(l)  =  op(l).  Thus 
the  left  hand  side  of  (iv)  can  be  written  as 


e'MZp 


Z2MZ2\~l      (Z'QMZo\~X 


Z'0Me 


Because  the  two  matrices  inside  the  bracket  converge  to  the  same  limit  on  Kt(V), 
(iv)  is  proved. 

Proof  of  Proposition  4:  By  the  definition  of  Vj(fc)  —  Vr(kc)  [see  (7)  and  (8)]  and 
Lemma  A. 4, 

VT(k)  -  VT(ko)  _  (44) 

=  -6'T{Z^MZ^)6T  +  6'T(Z'AMZ2){Z^MZ2)-\Z2MZ^)6T  +  h(k,6T,e).{45) 

From  the  definition  of  M,  the  first  term  of  (45)  is 


-l 


6y(Z^MZ^)St  =  SjZ^Zk&T  — 


S'fZ'^X  (X'X\      X'Z&St 


y/T       \     T     )  y/f 

By  Lemma  (A.5)(i),  the  second  term  on  the  right  is  op(l),  leaving  S'jZ'^Z^St-  Now 
the  second  term  of  (45)  can  be  written  as,  upon  appropriate  scaling, 

6T{Z^MZ2){Z2MZ2)-1{Z2MZ^)6T 

=  T-1/26t(Z^MZ2){Z'2MZ2/T)-\Z2MZ£,)6tT-1/2.  (46) 
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On  KT(V),  (Z^MZi/T)-1  =  0P(1)  and  by  Lemma  A.5  (ii),  (46)  is  op(l).  Next, 
consider  the  last  term  of  (45)  h(k,6T,e)  [see  (9)  and  (10)  for  its  definition].  Lemma 
A.5  (iv)  says  (10)  is  op(l).  Using  {Z'0MZ2)  =  {Z'2MZ2)  ±  Z'AMZ2  together  with 
Lemma  A.5,  it  is  easy  to  show 

26'T{Z'QMZ2){Z'2MZ2)-xZ'2Me  -  26'TZ'QMe  =  ±26'TZ*e  +  op(l). 

Combining  these  results  yields  Proposition  4. 
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